chat gpt 3
BiasBusters: Uncovering and Mitigating Tool Selection Bias in Large Language Models
Blankenstein, Thierry, Yu, Jialin, Li, Zixuan, Plachouras, Vassilis, Sengupta, Sunando, Torr, Philip, Gal, Yarin, Paren, Alasdair, Bibi, Adel
Agents backed by large language models (LLMs) often rely on external tools drawn from marketplaces where multiple providers offer functionally equivalent options. This raises a critical point concerning fairness: if selection is systematically biased, it can degrade user experience and distort competition by privileging some providers over others. We introduce a benchmark of diverse tool categories, each containing multiple functionally equivalent tools, to evaluate tool-selection bias. Using this benchmark, we test seven models and show that unfairness exists with models either fixating on a single provider or disproportionately preferring earlier-listed tools in context. To investigate the origins of this bias, we conduct controlled experiments examining tool features, metadata (name, description, parameters), and pre-training exposure. We find that: (1) semantic alignment between queries and metadata is the strongest predictor of choice; (2) perturbing descriptions significantly shifts selections; and (3) repeated pre-training exposure to a single endpoint amplifies bias. Finally, we propose a lightweight mitigation that first filters the candidate tools to a relevant subset and then samples uniformly, reducing bias while preserving good task coverage. Our findings highlight tool-selection bias as a key obstacle for the fair deployment of tool-augmented LLMs. Large language models (LLMs) have transformed natural language processing, achieving near-human performance on tasks ranging from code generation to creative writing (Naveed et al., 2024; Luo et al., 2024). Y et LLMs cannot directly act in the world: they cannot query databases, fetch live information, or invoke external services. Additionally, their knowledge remains frozen at training time, leaving them prone to "hallucinations" when asked about events beyond their cutoff (Ji et al., 2023). Augmenting LLMs with external "tools" / APIs addresses these shortcomings by allowing models to delegate specialized functions to dedicated services (Qu et al., 2025). It endows LLMs with the ability to act, a core capability often associated with LLM agents (Chowa et al., 2025). A crucial step within the typical tool-usage pipeline is the multi-stage tool-selection process: given an instruction to the LLM, (i) retrieve a short list of the most relevant candidate tools based on the user query (with, e.g., highest semantic similarity) from a potentially large database of tools, (ii) insert their metadata into the prompt, (iii) have the LLM reason and pick one to solve (one of) the necessary user task(s). However, this process introduces a new challenge: bias (see Figure 1).
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.14)
- Asia > Myanmar > Tanintharyi Region > Dawei (0.04)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (0.86)
Man develops psychosis following ChatGPT's salt-free diet
Breakthroughs, discoveries, and DIY tips sent every weekday. Reducing salt intake is often a solid way to improve your overall health. However, swapping out classic sodium chloride for sodium bromide is a solid way to give yourself acne, involuntary muscle spasms, and paranoid psychosis. Knowing this, it's probably best to avoid that chemical compound entirely--even if ChatGPT tells you otherwise. In the recent case, one patient that was allegedly following the generative AI's nutritional suggestion was placed in hospital's involuntary psychiatric hold for three weeks.
A ChatGPT-based approach for questions generation in higher education
Vu, Sinh Trong, Truong, Huong Thu, Do, Oanh Tien, Le, Tu Anh, Mai, Tai Tan
Large language models have been widely applied in many aspects of real life, bringing significant efficiency to businesses and offering distinctive user experiences. In this paper, we focus on exploring the application of ChatGPT, a chatbot based on a large language model, to support higher educator in generating quiz questions and assessing learners. Specifically, we explore interactive prompting patterns to design an optimal AI-powered question bank creation process. The generated questions are evaluated through a "Blind test" survey sent to various stakeholders including lecturers and learners. Initial results at the Banking Academy of Vietnam are relatively promising, suggesting a potential direction to streamline the time and effort involved in assessing learners at higher education institutes.
- Asia > Vietnam > Hanoi > Hanoi (0.05)
- South America > Uruguay > Maldonado > Maldonado (0.04)
- North America > United States > Minnesota (0.04)
- Europe > Ireland > Leinster > County Dublin > Dublin (0.04)
- Instructional Material > Course Syllabus & Notes (0.68)
- Research Report > New Finding (0.47)
Comparing Human and AI Rater Effects Using the Many-Facet Rasch Model
Jiao, Hong, Song, Dan, Lee, Won-Chan
Large language models (LLMs) have been widely explored for automated scoring in low-stakes assessment to facilitate learning and instruction. Empirical evidence related to which LLM produces the most reliable scores and induces least rater effects needs to be collected before the use of LLMs for automated scoring in practice. This study compared ten LLMs (ChatGPT 3.5, ChatGPT 4, ChatGPT 4o, OpenAI o1, Claude 3.5 Sonnet, Gemini 1.5, Gemini 1.5 Pro, Gemini 2.0, as well as DeepSeek V3, and DeepSeek R1) with human expert raters in scoring two types of writing tasks. The accuracy of the holistic and analytic scores from LLMs compared with human raters was evaluated in terms of Quadratic Weighted Kappa. Intra-rater consistency across prompts was compared in terms of Cronbach Alpha. Rater effects of LLMs were evaluated and compared with human raters using the Many-Facet Rasch model. The results in general supported the use of ChatGPT 4o, Gemini 1.5 Pro, and Claude 3.5 Sonnet with high scoring accuracy, better rater reliability, and less rater effects.
- North America > United States > New Jersey > Bergen County > Mahwah (0.04)
- North America > United States > Illinois > Cook County > Chicago (0.04)
- North America > United States > Oregon > Washington County > Beaverton (0.04)
- (4 more...)
- Education > Assessment & Standards (0.95)
- Education > Educational Setting (0.68)
- Education > Educational Technology > Educational Software > Computer-Aided Assessment (0.56)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.36)
A closer look at how large language models trust humans: patterns and biases
As large language models (LLMs) and LLM-based agents increasingly interact with humans in decision-making contexts, understanding the trust dynamics between humans and AI agents becomes a central concern. While considerable literature studies how humans trust AI agents, it is much less understood how LLM-based agents develop effective trust in humans. LLM-based agents likely rely on some sort of implicit effective trust in trust-related contexts (e.g., evaluating individual loan applications) to assist and affect decision making. Using established behavioral theories, we develop an approach that studies whether LLMs trust depends on the three major trustworthiness dimensions: competence, benevolence and integrity of the human subject. We also study how demographic variables affect effective trust. Across 43,200 simulated experiments, for five popular language models, across five different scenarios we find that LLM trust development shows an overall similarity to human trust development. We find that in most, but not all cases, LLM trust is strongly predicted by trustworthiness, and in some cases also biased by age, religion and gender, especially in financial scenarios. This is particularly true for scenarios common in the literature and for newer models. While the overall patterns align with human-like mechanisms of effective trust formation, different models exhibit variation in how they estimate trust; in some cases, trustworthiness and demographic factors are weak predictors of effective trust. These findings call for a better understanding of AI-to-human trust dynamics and monitoring of biases and trust development patterns to prevent unintended and potentially harmful outcomes in trust-sensitive applications of AI.
- Asia > Middle East > Israel > Jerusalem District > Jerusalem (0.04)
- North America > United States > New York > New York County > New York City (0.04)
- Health & Medicine (1.00)
- Banking & Finance (0.66)
LegalScore: Development of a Benchmark for Evaluating AI Models in Legal Career Exams in Brazil
Caparroz, Roberto, Roitman, Marcelo, Chow, Beatriz G., Giusti, Caroline, Torhacs, Larissa, Sola, Pedro A., Diogo, João H. M., Balby, Luiza, Vasconcelos, Carolina D. L., Caparroz, Leonardo R., Franco, Albano P.
This research introduces LegalScore, a specialized index for assessing how generative artificial intelligence models perform in a selected range of career exams that require a legal background in Brazil. The index evaluates fourteen different types of artificial intelligence models' performance, from proprietary to open-source models, in answering objective questions applied to these exams. The research uncovers the response of the models when applying English-trained large language models to Brazilian legal contexts, leading us to reflect on the importance and the need for Brazil-specific training data in generative artificial intelligence models. Performance analysis shows that while proprietary and most known models achieved better results overall, local and smaller models indicated promising performances due to their Brazilian context alignment in training. By establishing an evaluation framework with metrics including accuracy, confidence intervals, and normalized scoring, LegalScore enables systematic assessment of artificial intelligence performance in legal examinations in Brazil. While the study demonstrates artificial intelligence's potential value for exam preparation and question development, it concludes that significant improvements are needed before AI can match human performance in advanced legal assessments. The benchmark creates a foundation for continued research, highlighting the importance of local adaptation in artificial intelligence development.
- South America > Brazil > São Paulo (0.04)
- South America > Brazil > Santa Catarina (0.04)
- South America > Brazil > Rio de Janeiro > Rio de Janeiro (0.04)
- (7 more...)
- Law > Statutes (1.00)
- Law > Criminal Law (0.68)
- Law > Labor & Employment Law (0.68)
- (2 more...)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.68)
Advice for Diabetes Self-Management by ChatGPT Models: Challenges and Recommendations
Given their ability for advanced reasoning, extensive contextual understanding, and robust question-answering abilities, large language models have become prominent in healthcare management research. Despite adeptly handling a broad spectrum of healthcare inquiries, these models face significant challenges in delivering accurate and practical advice for chronic conditions such as diabetes. We evaluate the responses of ChatGPT versions 3.5 and 4 to diabetes patient queries, assessing their depth of medical knowledge and their capacity to deliver personalized, context-specific advice for diabetes self-management. Our findings reveal discrepancies in accuracy and embedded biases, emphasizing the models' limitations in providing tailored advice unless activated by sophisticated prompting techniques. Additionally, we observe that both models often provide advice without seeking necessary clarification, a practice that can result in potentially dangerous advice. This underscores the limited practical effectiveness of these models without human oversight in clinical settings. To address these issues, we propose a commonsense evaluation layer for prompt evaluation and incorporating disease-specific external memory using an advanced Retrieval Augmented Generation technique. This approach aims to improve information quality and reduce misinformation risks, contributing to more reliable AI applications in healthcare settings. Our findings seek to influence the future direction of AI in healthcare, enhancing both the scope and quality of its integration.
- North America > United States (0.46)
- Oceania > Australia (0.04)
- Asia > Pakistan (0.04)
Can Stories Help LLMs Reason? Curating Information Space Through Narrative
Javadi, Vahid Sadiri, Trippas, Johanne R., Lal, Yash Kumar, Flek, Lucie
Narratives are widely recognized as a powerful tool for structuring information and facilitating comprehension of complex ideas in various domains such as science communication. This paper investigates whether incorporating narrative elements can assist Large Language Models (LLMs) in solving complex problems more effectively. We propose a novel approach, Story of Thought (SoT), integrating narrative structures into prompting techniques for problem-solving. This approach involves constructing narratives around problem statements and creating a framework to identify and organize relevant information. Our experiments show that using various LLMs with SoT consistently surpasses using them with other techniques on physics, chemistry, math, and biology questions in both the GPQA and JEEBench datasets. The narrative-based information curation process in SoT enhances problem comprehension by contextualizing critical in-domain information and highlighting causal relationships within the problem space.
- North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
- North America > Canada > Ontario > Toronto (0.04)
- North America > United States > South Carolina (0.04)
- (7 more...)
- Overview (1.00)
- Research Report > Promising Solution (0.48)
Emotion-Aware Response Generation Using Affect-Enriched Embeddings with LLMs
Rasool, Abdur, Shahzad, Muhammad Irfan, Aslam, Hafsa, Chan, Vincent
There is a need for empathetic and coherent responses in automated chatbot-facilitated psychotherapy sessions. This study addresses the challenge of enhancing the emotional and contextual understanding of large language models (LLMs) in psychiatric applications. We introduce a novel framework that integrates multiple emotion lexicons, including NRC Emotion Lexicon, VADER, WordNet, and SentiWordNet, with state-of-the-art LLMs such as LLAMA 2, Flan-T5, ChatGPT 3.0, and ChatGPT 4.0. The primary dataset comprises over 2,000 therapy session transcripts from the Counseling and Psychotherapy database, covering discussions on anxiety, depression, trauma, and addiction. We segment the transcripts into smaller chunks, enhancing them with lexical features and computing embeddings using BERT, GPT-3, and RoBERTa to capture semantic and emotional nuances. These embeddings are stored in a FAISS vector database, enabling efficient similarity search and clustering based on cosine similarity. Upon user query, the most relevant segments are retrieved and provided as context to the LLMs, significantly improving the models' ability to generate empathetic and contextually appropriate responses. Experimental evaluations demonstrate that in-corporating emotion lexicons enhances empathy, coherence, informativeness, and fluency scores. Our findings highlight the critical role of emotional embeddings in improving LLM performance for psychotherapy.
- Asia > China > Jiangsu Province > Nanjing (0.04)
- North America > United States > Virginia > Alexandria County > Alexandria (0.04)
- North America > United States > Hawaii > Honolulu County > Honolulu (0.04)
- (4 more...)
Evaluating Research Quality with Large Language Models: An Analysis of ChatGPT's Effectiveness with Different Settings and Inputs
Evaluating the quality of academic journal articles is a time consuming but critical task for national research evaluation exercises, appointments and promotion. It is therefore important to investigate whether Large Language Models (LLMs) can play a role in this process. This article assesses which ChatGPT inputs (full text without tables, figures and references; title and abstract; title only) produce better quality score estimates, and the extent to which scores are affected by ChatGPT models and system prompts. The results show that the optimal input is the article title and abstract, with average ChatGPT scores based on these (30 iterations on a dataset of 51 papers) correlating at 0.67 with human scores, the highest ever reported. ChatGPT 4o is slightly better than 3.5-turbo (0.66), and 4o-mini (0.66). The results suggest that article full texts might confuse LLM research quality evaluations, even though complex system instructions for the task are more effective than simple ones. Thus, whilst abstracts contain insufficient information for a thorough assessment of rigour, they may contain strong pointers about originality and significance. Finally, linear regression can be used to convert the model scores into the human scale scores, which is 31% more accurate than guessing.
- Oceania > New Zealand (0.04)
- North America > United States > New York > New York County > New York City (0.04)
- Europe > United Kingdom > England > South Yorkshire > Sheffield (0.04)
- (2 more...)